Add IcebergDocument as one implementation of VirtualDocument #3147

bobbai00 · 2024-12-10T16:38:08Z

This PR introduces an implementation of result storage using Apache Iceberg.

How to enable the Iceberg result storage

Go to storage-config.yaml,

change result-storage-mode to iceberg
configure storage.iceberg.catalog.jdbc section,

iceberg:
    catalog:
      jdbc: # currently we only support storing catalog info via jdbc, i.e. https://iceberg.apache.org/docs/1.7.1/jdbc/
        url: "jdbc:mysql://localhost:3306/texera_iceberg?serverTimezone=UTC"
        username: ""
        password: ""

make sure the JDBC is accessible via the url, username, and password

Major changes

Introduced IcebergDocument: a thread-safe implementation of VirtualDocument for storing and reading results in Iceberg tables.
Introduced IcebergTableWriter: an append-only writer for Iceberg tables with configurable buffer size.
Added support for new configuration properties under storage.iceberg to specify catalog and table settings.

Introduced Dependencies

In workflow-core, some new packages are added

Iceberg-related packages
Hadoop common. The reason of adding this dependency is to pass the compilation: In the source code of iceberg-parquet, the line 160,
although the file is not of type HadoopOutputFile, it still creats a Hadoop Configuration() as the placeholder. During the runtime, we don't have any dependency on Hadoop or HDFS.

Overview of the behavior IcebergDocument and IcebergWriter

IcebergDocument:
- Handles reading and managing data in Iceberg tables.
- Initializes the table during construction, creating it if it does not exist or overriding it if specified.
- Supports iterator-based incremental read operations.
- Thread-safe for read and clear operations.
IcebergTableWriter:
- Writes data to Iceberg tables in an append-only manner.
- Creates new Parquet files for every buffer flush, ensuring immutability.
- Not thread-safe, so it should only be accessed by one thread at a time.

How the result will be stored via Iceberg tables

Given a storage key, a table named key will be created.
To append tuples to the table key, each worker will append immutable parquet files to the table's data space using IcebergTableWriter. To avoid the parquet filename collision, each worker will prefix its created file with ${workerIndex}_${fileIndex}, in which workerIndex is its index, and fileIndex is a number maintained that increased by 1 every time a new data file is created and flushed by the writer.
To read the tuples, the reader uses the iterator returned by IcebergDocument.get. This iterator can incrementally read new data while writers are appending tuples.

shengquan-ni · 2024-12-10T21:30:23Z

...orkflow-core/src/main/scala/edu/uci/ics/amber/core/storage/result/ItemizedFileDocument.scala

+
+  // Register a shutdown hook to delete the file when the JVM exits
+  sys.addShutdownHook {
+    withWriteLock {


why do we need a write lock, do we allow multiple writers to write the file?

This file is removed

shengquan-ni · 2024-12-10T21:34:09Z

...orkflow-core/src/main/scala/edu/uci/ics/amber/core/storage/result/ItemizedFileDocument.scala

+  override val bufferSize: Int = 1024
+
+  // Register a shutdown hook to delete the file when the JVM exits
+  sys.addShutdownHook {


The lifecycle for this file is also not correct. This file is created by the computing unit JVM, which can be killed right after the execution.

I suggest we do a global cleanup on OpResultStorage level on top of #3146.

This file is removed.

shengquan-ni · 2024-12-30T00:02:58Z

core/workflow-core/src/main/resources/storage-config.yaml

+    catalog:
+      jdbc: # currently we only support storing catalog info via jdbc, i.e. https://iceberg.apache.org/docs/1.7.1/jdbc/
+        url: "jdbc:mysql://0.0.0.0:3306/texera_iceberg?serverTimezone=UTC"
+        username: "root"


make sure to clean up those username and passwords.

shengquan-ni · 2024-12-30T00:09:10Z

...flow-core/src/main/scala/edu/uci/ics/amber/core/storage/result/iceberg/IcebergDocument.scala

+import scala.jdk.CollectionConverters._
+
+class IcebergDocument[T >: Null <: AnyRef](
+    val catalog: Catalog,


I think the catalog should be created/retrieved inside IcebergDocument to make it self-contained.

Changed. I added a singleton catalog instance and all IcebergDocuments will use that instance as the catalog.

# Conflicts: # core/amber/src/main/scala/edu/uci/ics/texera/workflow/WorkflowCompiler.scala # core/workflow-core/src/main/scala/edu/uci/ics/amber/core/storage/result/OpResultStorage.scala # core/workflow-operator/src/main/scala/edu/uci/ics/amber/operator/sink/managed/ProgressiveSinkOpExec.scala

# Conflicts: # core/workflow-core/src/main/scala/edu/uci/ics/amber/core/storage/result/OpResultStorage.scala

… into jiadong-add-file-result-storage

# Conflicts: # core/workflow-operator/src/main/scala/edu/uci/ics/amber/operator/SpecialPhysicalOpFactory.scala # core/workflow-operator/src/main/scala/edu/uci/ics/amber/operator/sink/ProgressiveSinkOpExec.scala

This reverts commit a2e53b5.

bobbai00 requested a review from shengquan-ni December 10, 2024 16:38

bobbai00 self-assigned this Dec 10, 2024

bobbai00 requested a review from Yicong-Huang December 10, 2024 16:39

shengquan-ni reviewed Dec 10, 2024

View reviewed changes

bobbai00 force-pushed the jiadong-add-file-result-storage branch 2 times, most recently from 6522779 to a83d779 Compare December 14, 2024 00:14

bobbai00 force-pushed the jiadong-add-file-result-storage branch 2 times, most recently from 1edb551 to cef347b Compare December 21, 2024 02:56

bobbai00 changed the title ~~Add PartitionDocument and ItemizedFileDocument~~ Add IcebergDocument as one implementation of VirtualDocument that can be used to store operator results Dec 22, 2024

bobbai00 changed the title ~~Add IcebergDocument as one implementation of VirtualDocument that can be used to store operator results~~ Add IcebergDocument as one implementation of VirtualDocument Dec 22, 2024

bobbai00 added 19 commits December 22, 2024 16:10

add itemized file document and partition document

1fe9f17

add unit test for PartitionDocument

219b82d

add more to unit tests

e446e9c

make PartitionDocument return T

9627b25

fix partition document test

b85fd45

refining the documents

8e6fec3

add type R to PartitionedItemizedFileDocument

288aea4

do a rename

c3a1d00

adding the arrow file document, TODO: fix the test

97c601e

pass the compilation

e2c5515

finish arrow document

c17a54e

start to add some iceberg related

bc38cc4

finish initial iceberg writer

51dd7cf

finish initial version of iceberg

481c437

refactor test parts

0274f66

finish 1st viable version

4663fef

fix the append read

9607f98

finish append read

d2d0ed7

finish concurrent write test

f4ea0e3

shengquan-ni reviewed Dec 30, 2024

View reviewed changes

bobbai00 added 11 commits December 29, 2024 19:28

add worker id when creating the writer

d553e32

drop the write lock for iceberg table writer

78df063

clean up the build sbt

92e2caf

fix py result storage issue

bb6961a

clean up the iceberg document

1be10bf

clean up the iceberg writer

7adfda4

add more comments on the iceberg util

4617564

add more comments

13731cb

refactor local file IO

2baa661

merge master

e105913

bobbai00 requested a review from shengquan-ni December 30, 2024 18:37

bobbai00 added 16 commits December 30, 2024 12:41

Merge branch 'master' into jiadong-add-file-result-storage

4cf144b

# Conflicts: # core/workflow-core/src/main/scala/edu/uci/ics/amber/core/storage/result/OpResultStorage.scala

cleanup the config

9b69f59

Merge remote-tracking branch 'origin/jiadong-add-file-result-storage'…

60445e6

… into jiadong-add-file-result-storage

cleanup the clear logic

9a482b1

fmt

decab8d

refactor the test to use the test db

9cb2674

make the test harder

51d8a1e

make the test more clean

39b0448

Merge branch 'master' into jiadong-add-file-result-storage

2655dae

# Conflicts: # core/workflow-operator/src/main/scala/edu/uci/ics/amber/operator/SpecialPhysicalOpFactory.scala # core/workflow-operator/src/main/scala/edu/uci/ics/amber/operator/sink/ProgressiveSinkOpExec.scala

incorporate worker idx to sink

73106dd

add format version and row lineage to the iceberg table

a2e53b5

Merge branch 'master' into jiadong-add-file-result-storage

cffafe0

Revert "add format version and row lineage to the iceberg table"

f54e38c

This reverts commit a2e53b5.

fix iceberg util spec

7176864

try to add the record id

76dd31c

try debugging the test

31070be

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add IcebergDocument as one implementation of VirtualDocument #3147

Add IcebergDocument as one implementation of VirtualDocument #3147

bobbai00 commented Dec 10, 2024 •

edited

Loading

shengquan-ni Dec 10, 2024 •

edited

Loading

bobbai00 Dec 30, 2024

shengquan-ni Dec 10, 2024

shengquan-ni Dec 10, 2024 •

edited

Loading

bobbai00 Dec 30, 2024

shengquan-ni Dec 30, 2024

bobbai00 Dec 30, 2024

shengquan-ni Dec 30, 2024

bobbai00 Dec 30, 2024

Add IcebergDocument as one implementation of VirtualDocument #3147

Are you sure you want to change the base?

Add IcebergDocument as one implementation of VirtualDocument #3147

Conversation

bobbai00 commented Dec 10, 2024 • edited Loading

How to enable the Iceberg result storage

Major changes

Introduced Dependencies

Overview of the behavior IcebergDocument and IcebergWriter

How the result will be stored via Iceberg tables

shengquan-ni Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

bobbai00 Dec 30, 2024

Choose a reason for hiding this comment

shengquan-ni Dec 10, 2024

Choose a reason for hiding this comment

shengquan-ni Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

bobbai00 Dec 30, 2024

Choose a reason for hiding this comment

shengquan-ni Dec 30, 2024

Choose a reason for hiding this comment

bobbai00 Dec 30, 2024

Choose a reason for hiding this comment

shengquan-ni Dec 30, 2024

Choose a reason for hiding this comment

bobbai00 Dec 30, 2024

Choose a reason for hiding this comment

bobbai00 commented Dec 10, 2024 •

edited

Loading

shengquan-ni Dec 10, 2024 •

edited

Loading

shengquan-ni Dec 10, 2024 •

edited

Loading